Predicting early mortality and severe intraventricular hemorrhage in very-low birth weight preterm infants: a nationwide, multicenter study using machine learning

Our aim was to develop a machine learning-based predictor for early mortality and severe intraventricular hemorrhage (IVH) in very-low birth weight (VLBW) preterm infants in Taiwan. We collected retrospective data from VLBW infants, dividing them into two cohorts: one for model development and internal validation (Cohort 1, 2016–2021), and another for external validation (Cohort 2, 2022). Primary outcomes included early mortality, severe IVH, and early poor outcomes (a combination of both). Data preprocessing involved 23 variables, with the top four predictors identified as gestational age, birth body weight, 5-min Apgar score, and endotracheal tube ventilation. Six machine learning algorithms were employed. Among 7471 infants analyzed, the selected predictors consistently performed well across all outcomes. Logistic regression and neural network models showed the highest predictive performance (AUC 0.81–0.90 in both internal and external validation) and were well-calibrated, confirmed by calibration plots and the lowest two mean Brier scores (0.0685 and 0.0691). We developed a robust machine learning-based outcome predictor using only four accessible variables, offering valuable prognostic information for parents and aiding healthcare providers in decision-making.


Outcomes
The primary outcomes of the study included: early mortality, severe IVH, and early poor outcomes (early mortality or severe IVH).Early mortality was defined as death within the first week of life and severe IVH was defined as IVH grade III or IV on cranial ultrasound, graded using Volpe's grading system 11

Data preprocessing
We collected essential data as variables for each enrolled infant, resulting in a total of 23 variables.These variables included the following: antenatal steroid use; prenatal magnesium sulphate (MgSO4) use; pregnancy-induced hypertension; chorioamnionitis; GA; BBW; multiple births; Cesarean section; small for GA (defined as birth weight below the 10th percentile for GA, referencing values for birth weight distributions from a previous study of the Taiwanese population) 12 ; sex; 1-min Apgar score; 5-min Apgar score; body temperature (defined as the rectal temperature measured for the first time within the first hour of birth); early-onset sepsis (defined as cultureproven sepsis occurring within 72 h of birth); respiratory distress syndrome; congenital anomalies (including chromosomal anomalies, skeletal dysplasia, inborn errors of metabolism, lethal or life-threatening anomalies in the cardiovascular, gastrointestinal, genitourinary, or pulmonary system, and other lethal or life-threatening anomalies); and seven delivery room resuscitation managements, including, neonatal resuscitation, oxygen supplementation, delivery room continuous positive airway pressure ventilation, positive pressure ventilation, endotracheal tube ventilation, chest compressions, and epinephrine administration.RapidMiner software version 10.0 (Altair Engineering, Troy, MI, USA; www.rapid miner.com) was used for data input and the cleaning of missing data.

Selection of variables
To facilitate practical applicability, we conducted variable selection using the information gain attribute evaluator provided by Weka software version 3.8.6 (Waikato Environment for Knowledge Analysis, Hamilton, New Zealand).After measuring the entropy gain in relation to the outcomes, an information gain attribute evaluator was used to evaluate the significance of each of the 23 variables 13 Additionally, we conducted an evaluation of collinearity between each variable.In the interest of establishing a more streamlined model, we selected the top-ranked variables based on their ranking.

ML algorithms and model building
The flow chart for building models using ML algorithms via Orange software version 3.34.0(University of Ljubljana, Ljubljana, Slovenia) 14 is shown in Fig. 1.
These models were developed using six algorithms: k-nearest neighbor (kNN), decision tree, random forest, neural network, logistic regression, and gradient boosting.
Brief descriptions of the six ML models are as follows: • The kNN algorithm 15 is an ML instance-based model that stores all instances of the training dataset and makes predictions based on neighborhood proximity, as defined by a similarity metric.• The decision tree algorithm 16 is a tree-structured prediction model that starts with a root node and progresses to a leaf node.Each internal node represented a predictor variable, each internal node connection represented a choice, and each leaf node represented the outcome variable.• The random forest algorithm 17 is an ML ensemble model that combines multiple decision trees to achieve increased prediction accuracy.Each uncorrelated decision tree in the random forest makes a prediction, and the prediction with the largest number of votes is used as the final prediction for the algorithm.
• The neural network algorithm 18 is an ML model that mimics the signal transmission through neurons in the human brain.The algorithm comprises multiple layers of nodes: an input layer, multiple hidden layers, and an output layer.Each node functions as a neuron, with a threshold value.If the collected signal reaches this threshold, the nodes are activated and the signal is transferred to the next layer in the network.Predictions were continuously generated until the signal reached the output layer.• The logistic regression algorithm 19 was used for binary and multiclass classifications.It utilizes a cost function, often known as a sigmoid function, to provide an estimate of probability values ranging from zero to one.• The gradient boosting algorithm 20 is another ensemble model that incorporates a large number of ML models to provide strong predictors.The algorithm uses a gradient boosting technique to calculate the residual error by training a simple base learner on all the training datasets.A new learner is then created to forecast the prior residual error and increase the accuracy of the prediction model.

Internal evaluation
A tenfold cross-validation approach was employed for internal model validation.The dataset was randomly divided into 10 groups, with nine groups used for training and one for testing in each iteration.The average performance of the test results was subsequently used to assess the overall performance of the model across all the groups.

Model comparison
The performance of all prediction models was assessed by comparing the area under the curve (AUC) using the Orange software.Additionally, calibration plots and mean Brier scores, calculated with the assistance of Python, were employed to evaluate the predictive ability and goodness of fit of the models.This facilitated the observation of agreement between the actual and predicted probabilities.

External validation
The predictive models that exhibited outstanding performance, developed using the Cohort 1 dataset, were subsequently applied to the Cohort 2 dataset for external validation.Furthermore, the AUCs were computed to assess their performance in this independent dataset.

Equation development
The intercepts and coefficients for the selected attributes across different outcomes were calculated using Orange software.Subsequently, we formulated the corresponding equations and developed estimators to predict the probabilities of various target outcomes.

Study population and patient characteristics
A total of 8531 newborns were enrolled during the study period.However, 711 newborns were excluded due to missing data and 349 were excluded because they died within 12 h of delivery.Consequently, 7471 newborns with complete records were included in the final study.Cohort 1 and 2 included 6558 and 913 infants, respectively.
In Cohort 1 (Table 1), there was a significant difference (p < 0.05) between each variable and target outcome, except for: the use of prenatal MgSO4 between the group with and without severe IVH (p = 0.157); multiple births, across all outcomes (p = 0.671 in early mortality, p = 0.32 severe IVH, and p = 0.22 early poor outcomes); and congenital anomalies between the group with and without severe IVH (p = 0.76).

Selection of predictors
Attribute selection, based on the Weka information gain attribute evaluator, enabled the condensed and generic application of the prediction models.The actual values generated by the evaluator for each variable were listed in Fig. 2 and Supplementary Table S1, revealing notable distinctions between the top five ranked variables and those ranked sixth and beyond.Furthermore, variables ranked second to fifth exhibited similar scores.Consequently, the initial selection included the top five variables: gestational age (GA), birth body weight (BBW), 1-min Apgar score, 5-min Apgar score, and endotracheal tube ventilation during initial resuscitation, for model development.
Additionally, considering collinearity concerns, further analysis was conducted using Variance Inflation Factor (VIF) values 21 as presented in the Supplementary Table S2.This analysis indicated significant collinearity between the 1-min Apgar score and the 5-min Apgar score.Based on prior research 22 The 5-min Apgar score is regarded as a more reliable predictor of neonatal outcomes compared to the 1-min Apgar score.Therefore, we opted to exclude the 1-min Apgar score from our prediction variables during model development.

Model development and comparison
The four most crucial variables, which were top-ranked and showed no significant collinearity, were utilized in the development of prediction models using Orange software.The internally validated receiver operating characteristic (ROC) curve results (Fig. 3) indicated that the neural network, logistic regression, and gradient boosting models were the most optimal predictive models for all target outcomes, with AUC values of 0.87, 0.86, and 0.86, respectively, for the prediction of early mortality; 0.82, 0.82, and 0.81, respectively, for severe IVH; and 0.84, 0.84, and 0.83, respectively, for early poor outcomes.The calibration plot illustrates the consistency between predictions and observations across different percentiles of predicted values.Comparing the calibration of all models through a scatter plot reveals the agreement between predictions and observations.According to Fig. 4, both logistic regression and neural network models demonstrated superior calibration performance, as depicted  in the calibration plot.Furthermore, the logistic regression model achieved the best mean Brier score across three predictive outcomes, with a score of 0.0685, followed closely by the neural network model, which attained the lowest mean Brier Score of 0.06906.In contrast, the kNN and decision tree models exhibited less favorable calibration performance, with the highest mean Brier scores recorded at 0.0811 and 0.08123, respectively.For external validation by Cohort 2, we utilized the most powerful prediction models, namely logistic regression and neural network models.The results of the ROC curve analysis (Fig. 5) indicated exceptional predictive capabilities across all outcomes.Specifically, the AUC values were 0.90 and 0.89, respectively, for early mortality prediction; 0.84 and 0.83, respectively, for severe IVH prediction; and 0.86 and 0.84 for early poor outcome prediction for the logistic regression and neural network models, respectively.

Equation development
We used Orange software to calculate the intercepts and coefficients necessary for constructing the prediction models through logistic regression.The results are summarized in Table 2.An equation was formulated for each target outcome as follows: outcome estimators suitable for clinical applications were developed using Microsoft Excel 2016.
As an illustrative example, consider a premature male infant born with a GA of 24 weeks and birth weight within the range of 601-700 g.The 5-min Apgar scores were 6, respectively.Importantly, intubation was not required during initial neonatal resuscitation in the delivery room.By inputting these parameters into the outcome estimator, we ascertained the following probabilities: 20% likelihood of early mortality, 35% likelihood of severe IVH, and 44% likelihood of early poor outcomes (Table 3).

Discussion
In this study, we used a nationwide retrospective database comprising data on VLBW preterm infants and their associated variables collected immediately after their initial management in the delivery room.Our objective was to develop a predictive model for early mortality, severe IVH, and early poor outcomes using an -ML approach.
Following the application of this approach, we identified GA, BBW, 5-min Apgar score, and intubation in the delivery room as the top four most crucial factors for constructing prediction models.Notably, we found that both the logistic regression and neural network models demonstrate superior performance, as indicated by their higher AUROC values.This suggests that they have better discriminative ability in distinguishing between different outcomes.Additionally, these models are well-calibrated, meaning that the predicted probabilities align closely with the observed frequencies of outcomes.Moreover, they have been effectively validated across different cohorts within this study, highlighting their robustness and generalizability across diverse populations or settings.Overall, the logistic regression and neural network models excel in terms of their high AUROC values, good calibration, and successful validation across various cohorts, making them reliable predictors of outcomes in this study.Currently available scoring systems for predicting early mortality in neonates include: the Clinical Risk Index for Babies (CRIB) II 23 Score for Neonatal Acute Physiology Perinatal Extension II (SNAPPE-II) 24 and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) calculator 25 for neonatal conditions or outcomes.These prediction models have been widely employed and subjected to external validation in multiple studies 26 In our research, similar to CRIB II and NICHD, we identified GA and BBW as significant risk factors.A systematic review underscored the significance of these risk factors in neonatal mortality in neonatal intensive care units, with GA and BBW emerging as the most frequently cited contributors to neonatal mortality 27 Additionally, an investigation conducted on the Taiwanese population, using data from birth certificates and death registries, established a robust correlation between GA, BBW, and the incidence of early mortality 28 In 1952, Dr. Virginia Apgar pioneered the development of a scoring system designed to evaluate the physical condition of newborns and gauge their need for resuscitative interventions.Her groundbreaking work revealed a significant correlation between neonatal survival up to 28 days of age and the infant's condition at delivery 29 Notably, contemporary research has substantiated the enduring relevance of the Apgar score, reaffirming its significance nearly five decades later 30 Although the Apgar score was initially conceived to assess term infants during an era characterized by high neonatal mortality rates among preterm infants, a recent investigation showed that the relative risk of neonatal mortality consistently escalates as the Apgar score diminishes across all GA categories 31 Similarly, we included the Apgar score as a pivotal variable for outcome prediction in our study.
In our study, intubation emerged as the most important variable among all initial management procedures conducted in the delivery room.Notably, corroborative research conducted in countries such as Korea 32 Iran 33 Thailand 34 and Brazil 35 has similarly identified intubation as a pivotal risk factor for neonatal outcomes.
In our study, antenatal steroid administration and multiple births did not demonstrate statistical significance as variables for outcome prediction despite their inclusion in the NICHD calculator.This discrepancy may be attributed to the high prevalence of antenatal steroid administration in Taiwan, where 85% of the patients in our study received this treatment, in contrast to the population encompassed by the NICHD calculation, where approximately 70% received antenatal steroids.These demographic differences within the study population may have attenuated the influence of these variables on study outcomes.
In contrast, Boghossian et al. 36 reported that the beneficial effects of antenatal steroids on mortality were statistically significant, primarily in infants born between 24 and 25 weeks of gestation.This observation suggests that the efficacy of antenatal steroids in reducing mortality may be contingent on GA.
Multiple births were associated with a notably elevated risk of mortality, particularly among extremely premature infants born at 26 weeks of gestation or earlier, as indicated in prior research 37 In our study cohort, where the mean GA of the infants was 28.7 weeks, this characteristic may explain why antenatal steroid administration and multiple births were not significant factors in our analysis.
ML is a subset of artificial intelligence that has been extensively used in healthcare 38 According to a recent systematic review 39 concerning the deployment of ML models for forecasting neonatal mortality, prominent ML algorithms include neural networks, logistic regression, and random forests.The reviewed articles collectively reported a mean AUC range spanning from 58.3 to 97.0%, with the average exceeding 70%.These findings underscore the ability of ML models to predict neonatal mortality.In our ML -based predictive models, the AUC values demonstrated a comparable and laudable level of performance when juxtaposed with other ML-based models.In the context of predicting IVH, it is noteworthy that all four variables incorporated into our predictor previously demonstrated strong predictive capabilities for IVH, with particular emphasis on GA.Furthermore, the significance of endotracheal tube ventilation has been underscored in the literature.Additionally, when comparing our IVH predictor to previous models (AUC 0.67-0.85for severe IVH prediction), our predictor exhibits an outstanding performance 40 Notably, despite external validation of the CRIB II, SNAPPE-II, and NICHD prediction models in diverse study populations, none of these models incorporated data from the Taiwanese population into their assessments.Predictive methodologies rely heavily on epidemiological population data to predict specific outcomes 41 It is important to emphasize that the utility of a predictive model may be compromised by the possibility that the model is built upon data that could become outdated by the time it undergoes validation.
To the best of our knowledge, our predictive model represents a pioneering endeavor in the development of outcome-predictive models.This was the first initiative to construct such models based on the most current and comprehensive datasets available in Taiwan.Moreover, our model can predict early mortality, severe IVH, and early poor outcomes in VLBW preterm infants immediately following their initial management in the delivery room.Remarkably, this predictive capability was achieved using only four factors, eliminating the need for time-consuming blood sampling; however, these inherent advantages may facilitate widespread application in the Taiwanese population.

Limitations
This study had several limitations.First, restrictions imposed by the available databases impeded the collection of precise clinical data such as blood pressure, oxygen demand, and comprehensive laboratory data encompassing hemograms, biochemical markers, and blood gas analyses.The inclusion of these clinical parameters could potentially enhance the predictive performance of the model 26,39 .Second, for privacy protection, the Taiwan neonatal network database recorded anonymous information, with gestational age rounded down and birth body weight recorded in ranges.These unavoidable limitations may impact the collinearity between variables.Third, while our prediction models demonstrated a high degree of accuracy in forecasting outcomes, they lack adaptability over time.As clinical dynamics evolve, these models may experience a decline in predictive accuracy.Fourth, variations in management and procedures across institutions may introduce potential biases that could be unavoidable in our study.Fifth, it is important to acknowledge that ML models may inadvertently manifest bias and discriminatory tendencies.Therefore, additional external validations across diverse population groups are required.This validation should explore whether the model generated can be applied with equal efficacy to populations other than Taiwanese cohorts to ensure a broader range of applicability.

Conclusions
In this study, we developed an outcome predictor designed to predict early mortality, severe IVH, and early poor outcomes in preterm VLBW infants.This predictive model relied on the assessment of four readily available factors immediately after birth: GA, BBW, 5-min Apgar score, and endotracheal tube ventilation during initial resuscitation.Our analysis has yielded a formula that demonstrates exceptional performance, as evidenced by the high AUC values in both the internal validation cohort and the independent external validation population.Furthermore, it is well-calibrated, as evaluated by calibration plots and mean Brier scores.This prediction formula may prove to be a valuable tool and provide essential prognostic information for parents, aiding them in making informed decisions regarding the care and future of VLBW preterm infants.Furthermore, it may offer healthcare providers valuable guidance and facilitates the formulation of effective decision-making strategies for the clinical management of vulnerable infants.However, further validation across diverse populations is required to ensure broader applicability.Moreover, the inclusion of clinical parameters may further improve model accuracy.

Figure 1 .
Figure 1.Flowchart of machine learning to build the predictive model.

Figure 2 .Figure 3 .
Figure 2. Radar charts of attribute selection with the information gain attribute evaluator.The top five critical variables on the radar chart are GA, BBW, endotracheal tube ventilation, 5-min Apgar score, and 1-min Apgar score.GA gestational age, BBW birth body weight, ETT endotracheal tube, Apgar 5 5-min Apgar score, Apgar 1 1-min Apgar score, RDS respiratory distress syndrome, BT body temperature, epinephrine epinephrine administration, PPV positive pressure ventilation, CPAP continuous positive airway pressure, E_sepsis early onset sepsis, NRP neonatal resuscitation, PIH pregnancy-induced hypertension, C/S Cesarean section, SGA small for gestational age.

Figure 4 .
Figure 4. Calibration plot and mean Brier score of six prediction models in the internal validation set.(a) Calibration plot of early mortality.(b) Calibration plot of severe IVH.(c) Calibration plot of early poor outcomes.(d) Mean Brier score of three target outcomes.

Figure 5 .
Figure 5. ROC curve analysis of three prediction models in the external validation set.(a) ROC of early mortality; (b) ROC of severe IVH; (c) ROC of early poor outcomes.ROC receiver operating characteristic.

Table 1 .
Demographic data of the participants.GA gestational age, BBW birth body weight, PIH pregnancy-induced hypertension, BT body temperature, RDS respiratory distress syndrome, DRCPAP delivery room continuous positive airway pressure.withand without severe IVH (p = 0.29) and with and without early poor outcomes (p = 0.20).The discrepancy observed, wherein significant differences were found between each variable and the target outcome in Cohort 1, whereas such differences were not apparent in Cohort 2, could potentially be attributed to the limited sample size of Cohort 2.

Table 2 .
Intercept and coefficient values of the attributes in various models developed using logistic regression.GA gestational age, BBW birth body weight, ETT endotracheal tube.

Table 3 .
A table of the early poor outcomes estimator.